1. Dataset Selection and Biological Justification

1.1 Brief introduction of GSE205677

This study, “Saturated Fat Impairs Circadian Transcriptomics through Histone Modification of Enhancers”, investigates how acute lipid overload affects circadian gene expression in human skeletal muscle. Primary skeletal muscle myotubes were derived from biopsies of healthy men, synchronized to a common circadian phase, and exposed to the saturated fatty acid palmitate. Cells were sampled every six hours over a 42-hour period, and gene expression was measured using bulk RNA-sequencing.

The dataset focuses on transcriptome-wide changes in circadian rhythmicity rather than disruption of core clock genes themselves. The authors show that palmitate exposure reprograms rhythmic gene expression and alters the cycling of histone H3 lysine 27 acetylation (H3K27ac), a chromatin mark associated with active enhancers. This provides a mechanistic link between lipid metabolism, epigenetic regulation, and circadian transcription in human skeletal muscle.

This dataset is well suited for differential expression and exploratory circadian analysis due to its controlled experimental design, time-resolved sampling, and availability of raw gene-level RNA-seq count data.

2. Data Cleaning and Gene Identifier Mapping

2.1 Initial data inspection and quality assessment

Include and install all dependency

Downloading Datasets

# Create data directory if it doesn't exist
dir.create("data", showWarnings = FALSE)

# GSE205677 (NGT control + palmitate-treated dataset)
file_205677 <- "data/GSE205677_counts.tsv.gz"
url_205677  <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE205nnn/GSE205677/suppl/GSE205677_counts.tsv.gz"

if (!file.exists(file_205677)) {
  download.file(url_205677, destfile = file_205677, mode = "wb")
}

# Verify file exists
file.exists(file_205677)
## [1] TRUE

Data Created and inspect dimension

# Read raw count matrix (GSE205677)
counts <- read.table(
  "data/GSE205677_counts.tsv.gz",
  header = TRUE,
  row.names = 1,
  sep = "\t",
  check.names = FALSE
)

# Record dataset information
data <- list(
  data = counts,
  description = paste(
    "Raw bulk RNA-seq gene-level count matrix from GSE205677.",
    "Rows correspond to Ensembl gene identifiers.",
    "Columns correspond to individual RNA-seq libraries from",
    "primary human skeletal muscle myotubes with normal glucose tolerance (NGT).",
    "Samples include both control (NGT_CTL) and palmitate-treated (NGT_PAL)",
    "conditions, collected across multiple time points and biological replicates."
  ),
  processing = c(
    "RNA sequencing",
    "read alignment",
    "gene-level read counting"
  )
)

# dim check
dim(data$data)
## [1] 16229   105

Here the data was stored along with a list documenting process has been done to it.

# Inspect column headers
head(colnames(data$data), 10)
##  [1] "NGT_CTL_12h_i01" "NGT_CTL_18h_i01" "NGT_CTL_30h_i01" "NGT_CTL_36h_i01"
##  [5] "NGT_CTL_42h_i01" "NGT_CTL_48h_i01" "NGT_CTL_54h_i01" "NGT_PAL_12h_i01"
##  [9] "NGT_PAL_18h_i01" "NGT_PAL_24h_i01"
# Inspect first few rows of the count matrix
head(data$data, 5)
##                 NGT_CTL_12h_i01 NGT_CTL_18h_i01 NGT_CTL_30h_i01 NGT_CTL_36h_i01
## ENSG00000227232               1               9               8               1
## ENSG00000279457              19              12              17              14
## ENSG00000237973              13              12              34              14
## ENSG00000248527              73              51              57              33
## ENSG00000228327               6               6               8               0
##                 NGT_CTL_42h_i01 NGT_CTL_48h_i01 NGT_CTL_54h_i01 NGT_PAL_12h_i01
## ENSG00000227232              18               3              23               8
## ENSG00000279457              65              30              33              18
## ENSG00000237973              78              53              42              17
## ENSG00000248527              33              52              40              78
## ENSG00000228327               9              13               8               8
##                 NGT_PAL_18h_i01 NGT_PAL_24h_i01 NGT_PAL_30h_i01 NGT_PAL_36h_i01
## ENSG00000227232               5               9              27               9
## ENSG00000279457               5              51              56              54
## ENSG00000237973               8              42              55              57
## ENSG00000248527              64              72              48              51
## ENSG00000228327               5               8              11              11
##                 NGT_PAL_42h_i01 NGT_PAL_48h_i01 NGT_PAL_54h_i01 NGT_CTL_12h_i04
## ENSG00000227232               7               5              22              17
## ENSG00000279457              29              36              48              51
## ENSG00000237973              37              38              48              83
## ENSG00000248527              47              51              36             143
## ENSG00000228327               6               5               9              15
##                 NGT_CTL_18h_i04 NGT_CTL_24h_i04 NGT_CTL_30h_i04 NGT_CTL_36h_i04
## ENSG00000227232              23              26              27              20
## ENSG00000279457              42              46              35              15
## ENSG00000237973              97              98              57              61
## ENSG00000248527             143             127              74              34
## ENSG00000228327              23               9              15              16
##                 NGT_CTL_42h_i04 NGT_CTL_48h_i04 NGT_CTL_54h_i04 NGT_PAL_12h_i04
## ENSG00000227232               8               7              13              12
## ENSG00000279457              23              12              28              36
## ENSG00000237973              55              45              40              50
## ENSG00000248527              63              43              40              62
## ENSG00000228327               7              14              14              13
##                 NGT_PAL_18h_i04 NGT_PAL_24h_i04 NGT_PAL_30h_i04 NGT_PAL_36h_i04
## ENSG00000227232               1              18              27              34
## ENSG00000279457               2              27              33              58
## ENSG00000237973               6              63              86              66
## ENSG00000248527              19             104             109              94
## ENSG00000228327               3              15              30              33
##                 NGT_PAL_42h_i04 NGT_PAL_48h_i04 NGT_CTL_12h_i12 NGT_CTL_24h_i12
## ENSG00000227232              12              13               8               2
## ENSG00000279457              13              41              35              32
## ENSG00000237973              24              88              22              16
## ENSG00000248527              42             101              44              34
## ENSG00000228327               2              20               4               7
##                 NGT_CTL_30h_i12 NGT_CTL_36h_i12 NGT_CTL_42h_i12 NGT_CTL_48h_i12
## ENSG00000227232              11              10               8               9
## ENSG00000279457              31              58              38              31
## ENSG00000237973              19              38              63              19
## ENSG00000248527              33              43              73              24
## ENSG00000228327              11              14              10              14
##                 NGT_CTL_54h_i12 NGT_PAL_12h_i12 NGT_PAL_18h_i12 NGT_PAL_24h_i12
## ENSG00000227232              33              17              16               6
## ENSG00000279457             155              48              65              62
## ENSG00000237973              60              14              42              93
## ENSG00000248527             130              38              53             179
## ENSG00000228327              19               9              15              20
##                 NGT_PAL_30h_i12 NGT_PAL_36h_i12 NGT_PAL_42h_i12 NGT_PAL_48h_i12
## ENSG00000227232               5               2              11               7
## ENSG00000279457              45              44              54              34
## ENSG00000237973              33              22              51              12
## ENSG00000248527              62              27              42              26
## ENSG00000228327               4               4               2               6
##                 NGT_PAL_54h_i12 NGT_CTL_12h_i08 NGT_CTL_18h_i08 NGT_CTL_24h_i08
## ENSG00000227232               1              15               5               4
## ENSG00000279457              10              44               5              20
## ENSG00000237973               7              35              10               9
## ENSG00000248527               3              45              18              11
## ENSG00000228327               2              42               6               3
##                 NGT_CTL_30h_i08 NGT_CTL_36h_i08 NGT_CTL_42h_i08 NGT_CTL_48h_i08
## ENSG00000227232              30              26              31               3
## ENSG00000279457              40              52              43              15
## ENSG00000237973              38              36              39              21
## ENSG00000248527              39              27              22              17
## ENSG00000228327              15              17              29               3
##                 NGT_CTL_54h_i08 NGT_PAL_12h_i08 NGT_PAL_18h_i08 NGT_PAL_24h_i08
## ENSG00000227232              33               5               3              13
## ENSG00000279457              31              16               2              36
## ENSG00000237973              32               9               9              14
## ENSG00000248527              50              31              12              25
## ENSG00000228327              15               7               4              24
##                 NGT_PAL_30h_i08 NGT_PAL_36h_i08 NGT_PAL_42h_i08 NGT_PAL_48h_i08
## ENSG00000227232              18              16              15              39
## ENSG00000279457              24              46              56              54
## ENSG00000237973              28              41              36              65
## ENSG00000248527              36              25              37              29
## ENSG00000228327              24              29              30              33
##                 NGT_PAL_54h_i08 NGT_CTL_12h_i09 NGT_CTL_18h_i09 NGT_CTL_24h_i09
## ENSG00000227232              27               0               0               2
## ENSG00000279457              47               1               0               0
## ENSG00000237973              36               0               2               3
## ENSG00000248527              10              12               5               0
## ENSG00000228327              14               0               0               0
##                 NGT_CTL_30h_i09 NGT_CTL_36h_i09 NGT_CTL_48h_i09 NGT_CTL_54h_i09
## ENSG00000227232               0               3               2               3
## ENSG00000279457               2               2               4               6
## ENSG00000237973               1               1               3               1
## ENSG00000248527               2               1               2               1
## ENSG00000228327               3               1               0               2
##                 NGT_PAL_18h_i09 NGT_PAL_24h_i09 NGT_PAL_30h_i09 NGT_PAL_36h_i09
## ENSG00000227232               0               5               2               4
## ENSG00000279457               0               7               3               0
## ENSG00000237973               0               1               1              10
## ENSG00000248527              20               3               4              12
## ENSG00000228327               0               0               2               0
##                 NGT_PAL_42h_i09 NGT_PAL_48h_i09 NGT_PAL_54h_i09 NGT_CTL_12h_i10
## ENSG00000227232               0               2              14              21
## ENSG00000279457               1               1               7              34
## ENSG00000237973               4               3              13              97
## ENSG00000248527               2               3              13              71
## ENSG00000228327               1               0               3              17
##                 NGT_CTL_18h_i10 NGT_CTL_24h_i10 NGT_CTL_30h_i10 NGT_CTL_36h_i10
## ENSG00000227232               3              15              20              18
## ENSG00000279457              19              32              22              23
## ENSG00000237973              58              57              81              52
## ENSG00000248527              68              86              97             100
## ENSG00000228327               9               7               4              14
##                 NGT_CTL_42h_i10 NGT_CTL_48h_i10 NGT_CTL_54h_i10 NGT_PAL_12h_i10
## ENSG00000227232              26              15              25               6
## ENSG00000279457              32              28              28              18
## ENSG00000237973              66             106             131              41
## ENSG00000248527              56              97             114               6
## ENSG00000228327              16              22              26              11
##                 NGT_PAL_18h_i10 NGT_PAL_24h_i10 NGT_PAL_30h_i10 NGT_PAL_36h_i10
## ENSG00000227232              12              13              16              25
## ENSG00000279457              39              26              57              37
## ENSG00000237973              99              70              73             109
## ENSG00000248527             133              72              80             133
## ENSG00000228327              18              11              30              22
##                 NGT_PAL_42h_i10 NGT_PAL_48h_i10 NGT_CTL_12h_i11 NGT_CTL_18h_i11
## ENSG00000227232              17              13               2               2
## ENSG00000279457              24              45               2               3
## ENSG00000237973              85             115               9               4
## ENSG00000248527              77              81              13               5
## ENSG00000228327               7              22               5               0
##                 NGT_CTL_24h_i11 NGT_CTL_30h_i11 NGT_CTL_36h_i11 NGT_CTL_42h_i11
## ENSG00000227232              18               8               5               8
## ENSG00000279457              15              14               4              23
## ENSG00000237973              36              23               9              28
## ENSG00000248527              16              16              12              20
## ENSG00000228327               6               4               4               5
##                 NGT_CTL_48h_i11 NGT_CTL_54h_i11 NGT_PAL_18h_i11 NGT_PAL_24h_i11
## ENSG00000227232               2              21               0               1
## ENSG00000279457              26              21               8               2
## ENSG00000237973              20              41               7              10
## ENSG00000248527              27              23             139             105
## ENSG00000228327               4               4               7               2
##                 NGT_PAL_30h_i11 NGT_PAL_36h_i11 NGT_PAL_42h_i11 NGT_PAL_48h_i11
## ENSG00000227232              15               4              13              11
## ENSG00000279457              40              11              22              25
## ENSG00000237973              31              20              27              34
## ENSG00000248527              38              22              18              16
## ENSG00000228327               8               7               7              10
##                 NGT_PAL_54h_i11
## ENSG00000227232              42
## ENSG00000279457              75
## ENSG00000237973              38
## ENSG00000248527              22
## ENSG00000228327              26
# Record inspection step in processing history
data$processing <- c(
  data$processing,
  "inspection"
)

# View updated processing record
data$processing
## [1] "RNA sequencing"           "read alignment"          
## [3] "gene-level read counting" "inspection"

Due to the data-set contains same cell line in different time points, we inspect the data structure further:

# Extract sample names
samples <- colnames(data$data)

# Define condition based on column names
condition <- ifelse(grepl("_PAL_", samples), "Treatment", "Control")

table(condition)
## condition
##   Control Treatment 
##        53        52

One can see this data set actually contains 5-7 sample per time x cell line x treatment

# Extract time point (e.g. 12h, 18h, ...)
timepoint <- sub(".*_([0-9]+h)_i[0-9]+$", "\\1", samples)

table(timepoint)
## timepoint
## 12h 18h 24h 30h 36h 42h 48h 54h 
##  12  13  13  14  14  13  14  12
table(condition, timepoint)
##            timepoint
## condition   12h 18h 24h 30h 36h 42h 48h 54h
##   Control     7   6   6   7   7   6   7   7
##   Treatment   5   7   7   7   7   7   7   5

Inspect library size overall.

# Total reads per sample
library_size <- colSums(data$data)

summary(library_size)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  1528275 10172642 19278178 17977321 23966538 44495793

Inspect library size group by time and treatment

sample_info <- data.frame(
  sample     = samples,
  condition  = condition,
  timepoint  = timepoint,
  library_size = library_size,
  stringsAsFactors = FALSE
)

library_summary <- aggregate(
  library_size ~ condition + timepoint,
  data = sample_info,
  FUN = function(x) c(
    n = length(x),
    mean = mean(x),
    median = median(x),
    min = min(x),
    max = max(x)
  )
)
boxplot(
  library_size ~ condition + timepoint,
  data = sample_info,
  las = 2,
  ylab = "Total read count per library",
  main = "Library size by condition and time point"
)

Figure 1. Library size distribution across condition and time point

Boxplots show the total sequencing depth (library size) for each RNA-seq sample, grouped by treatment condition (CTL vs PAL) and circadian time point. Overall library sizes are comparable across conditions and time points, indicating no major systematic differences in sequencing depth prior to normalization.

2.2 Gene identifier mapping to HUGO symbols

Include ensembl at fix version 115 for reproducebilty. We restric it to hsapiens_gene_ensembl data.

library(biomaRt)
ensembl_version <- 115
ensembl <- useEnsembl(biomart = "ensembl",version = ensembl_version)
ensembl <- useDataset(
  "hsapiens_gene_ensembl",
  mart = ensembl
)

Preparation of ids by strip version suffix, inspect the stripped raw ID’s:

# Extract Ensembl gene IDs from count matrix
ensembl_ids <- rownames(data$data)

# Strip version suffix if present (safe even if none exist)
strip_ensembl_version <- function(x) sub("\\..*$", "", x)
ensembl_ids_clean <- strip_ensembl_version(ensembl_ids)

length(ensembl_ids_clean)
## [1] 16229
head(ensembl_ids_clean)
## [1] "ENSG00000227232" "ENSG00000279457" "ENSG00000237973" "ENSG00000248527"
## [5] "ENSG00000228327" "ENSG00000237491"

Create Mapping

gene_map <- getBM(
  attributes = c("ensembl_gene_id", "hgnc_symbol"),
  filters    = "ensembl_gene_id",
  values     = ensembl_ids_clean,
  mart       = ensembl
)

dim(gene_map)
## [1] 16063     2
head(gene_map)
##   ensembl_gene_id hgnc_symbol
## 1 ENSG00000000457       SCYL3
## 2 ENSG00000000460       FIRRM
## 3 ENSG00000000971         CFH
## 4 ENSG00000001460       STPG1
## 5 ENSG00000001461      NIPAL3
## 6 ENSG00000001617      SEMA3F

Cache mapping

cache_file <- "data/ensembl_to_hgnc_map.rds"

if (!file.exists(cache_file)) {
  saveRDS(gene_map, cache_file)
} else {
  gene_map <- readRDS(cache_file)
}

2.3 Handling unmapped and duplicated identifiers

Here possible conflicts is handled by the following rule:

# Convert counts to data frame with gene IDs
counts_df <- as.data.frame(data$data)
counts_df$ensembl_gene_id <- ensembl_ids_clean
  1. Ensembl IDs mapping to multiple symbols (ambiguous) These ids will be assigned na.
ambiguous_ensembl <- gene_map %>%
  group_by(ensembl_gene_id) %>%
  summarise(n_symbols = n_distinct(hgnc_symbol)) %>%
  filter(n_symbols > 1)

ambiguous_ensembl
## # A tibble: 1 × 2
##   ensembl_gene_id n_symbols
##   <chr>               <int>
## 1 ENSG00000230417         2

Here we see only 1 gene has mapped to 2 symbols.

  1. Multiple Ensembl IDs mapping to the same symbol These will be kept sperate but given same symbol.
multi_ensembl_per_symbol <- gene_map %>%
  filter(hgnc_symbol != "") %>%
  group_by(hgnc_symbol) %>%
  summarise(n_genes = n_distinct(ensembl_gene_id)) %>%
  filter(n_genes > 1)

multi_ensembl_per_symbol
## # A tibble: 1 × 2
##   hgnc_symbol n_genes
##   <chr>         <int>
## 1 POLR2J3           2

Here we see 1 symbol get mapped to 2 rows.

Merge the annotation back to the data.

# Identify ambiguous Ensembl IDs
ambiguous_ids <- ambiguous_ensembl$ensembl_gene_id

# Clean gene_map
gene_map_clean <- gene_map %>%
  mutate(
    hgnc_symbol = ifelse(
      ensembl_gene_id %in% ambiguous_ids,
      NA,
      hgnc_symbol
    )
  ) %>%
  distinct(ensembl_gene_id, hgnc_symbol)

# Join HGNC symbols
counts_annotated <- counts_df %>%
  left_join(gene_map, by = "ensembl_gene_id")

# Reorder columns: annotation first
counts_annotated <- counts_annotated %>%
  relocate(ensembl_gene_id, hgnc_symbol)

# Record back into main data object
data$data <- counts_annotated

# Record processing step
data$processing <- c(
  data$processing,
  "Ensembl gene Mapping"
)

head(data$data[, 1:6])
##   ensembl_gene_id hgnc_symbol NGT_CTL_12h_i01 NGT_CTL_18h_i01 NGT_CTL_30h_i01
## 1 ENSG00000227232      WASH7P               1               9               8
## 2 ENSG00000279457      WASH9P              19              12              17
## 3 ENSG00000237973    MTCO1P12              13              12              34
## 4 ENSG00000248527    MTATP6P1              73              51              57
## 5 ENSG00000228327                           6               6               8
## 6 ENSG00000237491   LINC01409              23              16              43
##   NGT_CTL_36h_i01
## 1               1
## 2              14
## 3              14
## 4              33
## 5               0
## 6              10

Check for number of duplication in annotation after annotation. This duplication is larger because it also counts for unmapped genes.

# total genes symbols
table(!is.na(counts_annotated$hgnc_symbol))
## 
## FALSE  TRUE 
##   167 16063
# duplicated symbols
sum(duplicated(counts_annotated$hgnc_symbol, na.rm = TRUE))
## [1] 982

2.4 Outliners

Given the robustness of downstream RNA-seq statistical methods to extreme values, no outliner removal was performed at this stage in the absence of independent evidence of technical failure.

Here we name the rows by handle the hgnc symbol conflict by give it a version number and do a minimum 10 count in 10 samples cutoff:

# Prefer HGNC symbol, fall back to Ensembl if missing
gene_label <- ifelse(
  !is.na(data$data$hgnc_symbol) & data$data$hgnc_symbol != "",
  data$data$hgnc_symbol,
  data$data$ensembl_gene_id
)

gene_label_unique <- make.unique(gene_label)

rownames(data$data) <- gene_label_unique

# Identify count columns (exclude annotation)
count_cols <- !(colnames(data$data) %in% c("ensembl_gene_id", "hgnc_symbol"))

counts_mat <- as.matrix(data$data[, count_cols])

# Define filtering threshold
min_count   <- 10
min_samples <- 10

keep_genes <- rowSums(counts_mat >= min_count) >= min_samples

table(keep_genes)
## keep_genes
##  TRUE 
## 16230
# Filter the full data object (annotation + counts)
data$data_filtered <- data$data[keep_genes, ]


dim(data$data_filtered)
## [1] 16230   107
data$processing <- c(
  data$processing,
  paste0(
    "Low-count gene filtering: retained genes with ≥ ",
    min_count, " counts in ≥ ",
    min_samples, " samples"
  )
)

data$processing
## [1] "RNA sequencing"                                                           
## [2] "read alignment"                                                           
## [3] "gene-level read counting"                                                 
## [4] "inspection"                                                               
## [5] "Ensembl gene Mapping"                                                     
## [6] "Low-count gene filtering: retained genes with ≥ 10 counts in ≥ 10 samples"

3. Quality Control and Normalization

3.1 Sample metadata construction and experimental design

we first plot the pre normalization distribution:

Define plot functions:

plot_box <- function(mat, main = "") {
  boxplot(
    log2(mat + 1),
    outline = FALSE,
    las = 2,
    main = main,
    ylab = "log2(count + 1)"
  )
}

plot_density <- function(mat, main = "") {
  mat <- log2(mat + 1)
  dens <- apply(mat, 2, density)
  
  plot(dens[[1]], col = "grey50", main = main, xlab = "log2(count + 1)")
  for (i in 2:length(dens)) {
    lines(dens[[i]], col = "grey50")
  }
}

Define time point grouping function:

split_by_timepoint <- function(count_matrix, sample_names = colnames(count_matrix)) {
  
  # Extract time point from sample names (e.g. 12h, 18h, ...)
  timepoint <- sub(".*_([0-9]+h)_i[0-9]+$", "\\1", sample_names)
  
  # Split sample names by time point
  samples_by_tp <- split(sample_names, timepoint)
  
  # Create a list of sub-matrices
  count_by_tp <- lapply(samples_by_tp, function(samps) {
    count_matrix[, samps, drop = FALSE]
  })
  
  return(count_by_tp)
}
inspect_timepoint_counts <- function(counts_by_tp) {
  for (tp in names(counts_by_tp)) {
    message("Inspecting time point: ", tp)
    
    mat <- counts_by_tp[[tp]]
    
    cat("Dimensions:", dim(mat), "\n")
    print(summary(as.vector(mat)))
    
    plot_box(mat, main = paste("Raw counts –", tp))
    plot_density(mat, main = paste("Raw counts density –", tp))
  }
}

Here we plot for each time point the distribution of raw counts.

counts_raw <- as.matrix(data$data[, -c(1, 2)])  
samples <- colnames(counts_raw)
counts_by_timepoint <- split_by_timepoint(counts_raw)
inspect_timepoint_counts(counts_by_timepoint)
## Inspecting time point: 12h
## Dimensions: 16230 12 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      36     202    1150     775 1344105

## Inspecting time point: 18h
## Dimensions: 16230 13 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0      14.0      65.0     812.4     326.0 2203445.0

## Inspecting time point: 24h
## Dimensions: 16230 13 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      27     152    1035     684 1338621

## Inspecting time point: 30h
## Dimensions: 16230 14 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      40     229    1246     869 1848962

## Inspecting time point: 36h
## Dimensions: 16230 14 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      38     201    1128     774 1790186

## Inspecting time point: 42h
## Dimensions: 16230 13 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      42     241    1169     838 2043978

## Inspecting time point: 48h
## Dimensions: 16230 14 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      38     209    1119     777 2105929

## Inspecting time point: 54h
## Dimensions: 16230 12 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      40     221    1198     819 2894113

Figure 2. Distribution of raw gene expression counts across time points

Boxplots and density plots display the distribution of raw RNA-seq gene counts for samples grouped by circadian time point. Substantial variability in count distributions reflects differences in sequencing depth and library composition prior to normalization. The high proportion of low-count and zero-count genes motivates subsequent filtering and normalization steps.

we extract design matrix from the naming rules:

## ---- design-matrix-from-names ----

# Extract sample names
samples <- colnames(data$data)

# Build sample metadata directly from column names
sample_data <- data.frame(
  sample    = samples,
  treatment = sub(".*_(CTL|PAL)_.*", "\\1", samples),
  timepoint = sub(".*_([0-9]+h)_.*", "\\1", samples),
  replicate = sub(".*_(i[0-9]+)$", "\\1", samples),
  stringsAsFactors = FALSE
)

# Define biological condition: treatment × timepoint
sample_data$condition <- paste(
  sample_data$treatment,
  sample_data$timepoint,
  sep = "_"
)

# Inspect true sample sizes per condition
table(sample_data$condition)
## 
##                         CTL_12h                         CTL_18h 
##                               7                               6 
##                         CTL_24h                         CTL_30h 
##                               6                               7 
##                         CTL_36h                         CTL_42h 
##                               7                               6 
##                         CTL_48h                         CTL_54h 
##                               7                               7 
## ensembl_gene_id_ensembl_gene_id         hgnc_symbol_hgnc_symbol 
##                               1                               1 
##                         PAL_12h                         PAL_18h 
##                               5                               7 
##                         PAL_24h                         PAL_30h 
##                               7                               7 
##                         PAL_36h                         PAL_42h 
##                               7                               7 
##                         PAL_48h                         PAL_54h 
##                               7                               5
# Construct design matrix
design <- model.matrix(~ 0 + condition, data = sample_data)
rownames(design) <- sample_data$sample
colnames(design) <- levels(factor(sample_data$condition))

# Inspect design matrix
head(design)
##                 CTL_12h CTL_18h CTL_24h CTL_30h CTL_36h CTL_42h CTL_48h CTL_54h
## ensembl_gene_id       0       0       0       0       0       0       0       0
## hgnc_symbol           0       0       0       0       0       0       0       0
## NGT_CTL_12h_i01       1       0       0       0       0       0       0       0
## NGT_CTL_18h_i01       0       1       0       0       0       0       0       0
## NGT_CTL_30h_i01       0       0       0       1       0       0       0       0
## NGT_CTL_36h_i01       0       0       0       0       1       0       0       0
##                 ensembl_gene_id_ensembl_gene_id hgnc_symbol_hgnc_symbol PAL_12h
## ensembl_gene_id                               1                       0       0
## hgnc_symbol                                   0                       1       0
## NGT_CTL_12h_i01                               0                       0       0
## NGT_CTL_18h_i01                               0                       0       0
## NGT_CTL_30h_i01                               0                       0       0
## NGT_CTL_36h_i01                               0                       0       0
##                 PAL_18h PAL_24h PAL_30h PAL_36h PAL_42h PAL_48h PAL_54h
## ensembl_gene_id       0       0       0       0       0       0       0
## hgnc_symbol           0       0       0       0       0       0       0
## NGT_CTL_12h_i01       0       0       0       0       0       0       0
## NGT_CTL_18h_i01       0       0       0       0       0       0       0
## NGT_CTL_30h_i01       0       0       0       0       0       0       0
## NGT_CTL_36h_i01       0       0       0       0       0       0       0

Also prepare matrix in TMM ready format:

## ---- prepare-filtered-counts ----

count_cols <- !(colnames(data$data_filtered) %in% c("ensembl_gene_id", "hgnc_symbol"))

counts_filtered <- as.matrix(
  data$data_filtered[, count_cols]
)

storage.mode(counts_filtered) <- "integer"

dim(counts_filtered)
## [1] 16230   105

3.2 Low-count gene filtering with Design

We further filtered with experiment design to see if any other QC can be done.

library(edgeR)

dge <- DGEList(counts = counts_filtered)

keep <- filterByExpr(
  dge,
  design = design,
  min.count = 10
)

table(keep)
## keep
##  TRUE 
## 16230
dge_filtered <- dge[keep, , keep.lib.sizes = FALSE]

dim(dge_filtered)
## [1] 16230   105

3.3 Normalization using TMM and CPM

Run the normalization using TMM

dge_filtered <- calcNormFactors(dge_filtered, method = "TMM")

dge_filtered$samples
##                 group lib.size norm.factors
## NGT_CTL_12h_i01     1 12579412    0.9401901
## NGT_CTL_18h_i01     1 11604458    0.8762113
## NGT_CTL_30h_i01     1 14018300    0.9742833
## NGT_CTL_36h_i01     1  6284707    0.5663289
## NGT_CTL_42h_i01     1 21050408    0.9677165
## NGT_CTL_48h_i01     1 22118180    0.9921210
## NGT_CTL_54h_i01     1 21015590    1.0150009
## NGT_PAL_12h_i01     1 15001968    0.9486170
## NGT_PAL_18h_i01     1  9378934    0.6510577
## NGT_PAL_24h_i01     1 22413991    0.9893454
## NGT_PAL_30h_i01     1 24117644    0.9676133
## NGT_PAL_36h_i01     1 23530382    0.9941203
## NGT_PAL_42h_i01     1 17498905    0.9910972
## NGT_PAL_48h_i01     1 17149330    0.9880251
## NGT_PAL_54h_i01     1 22115586    0.9872006
## NGT_CTL_12h_i04     1 35954433    1.0298082
## NGT_CTL_18h_i04     1 41526006    1.0001466
## NGT_CTL_24h_i04     1 34786423    1.0030701
## NGT_CTL_30h_i04     1 29763457    1.0615500
## NGT_CTL_36h_i04     1 17943959    1.0260868
## NGT_CTL_42h_i04     1 20917680    1.0336922
## NGT_CTL_48h_i04     1 15334398    1.0329340
## NGT_CTL_54h_i04     1 18670758    1.0651072
## NGT_PAL_12h_i04     1 23183424    1.0432817
## NGT_PAL_18h_i04     1  3234816    0.5734256
## NGT_PAL_24h_i04     1 23693310    1.0423516
## NGT_PAL_30h_i04     1 38046696    1.0712179
## NGT_PAL_36h_i04     1 34661015    1.0826845
## NGT_PAL_42h_i04     1 10932135    1.0094705
## NGT_PAL_48h_i04     1 34527762    1.0618532
## NGT_CTL_12h_i12     1 19546345    1.1077173
## NGT_CTL_24h_i12     1 16527449    1.0534901
## NGT_CTL_30h_i12     1 15444182    0.9484654
## NGT_CTL_36h_i12     1 22147715    1.1458765
## NGT_CTL_42h_i12     1 28182171    1.0647247
## NGT_CTL_48h_i12     1 15415991    0.9827823
## NGT_CTL_54h_i12     1 44495887    1.0826007
## NGT_PAL_12h_i12     1 22997188    1.1296860
## NGT_PAL_18h_i12     1 29604059    1.0854147
## NGT_PAL_24h_i12     1 22060229    1.0716438
## NGT_PAL_30h_i12     1 21029830    1.1084968
## NGT_PAL_36h_i12     1 19278231    1.1087285
## NGT_PAL_42h_i12     1 21705545    1.0810994
## NGT_PAL_48h_i12     1 16338766    1.0780476
## NGT_PAL_54h_i12     1  4436072    1.1034041
## NGT_CTL_12h_i08     1 24670859    1.1168095
## NGT_CTL_18h_i08     1  5580150    0.7825607
## NGT_CTL_24h_i08     1  6982719    0.8117670
## NGT_CTL_30h_i08     1 22573744    1.1020060
## NGT_CTL_36h_i08     1 20842422    1.0907996
## NGT_CTL_42h_i08     1 24073525    1.0815870
## NGT_CTL_48h_i08     1 18130460    1.0986940
## NGT_CTL_54h_i08     1 19854305    1.0801377
## NGT_PAL_12h_i08     1  9547377    0.9900916
## NGT_PAL_18h_i08     1  4036522    0.6487874
## NGT_PAL_24h_i08     1 20310196    1.1206482
## NGT_PAL_30h_i08     1 20087212    1.0890550
## NGT_PAL_36h_i08     1 25223743    1.0865235
## NGT_PAL_42h_i08     1 23966624    1.0799293
## NGT_PAL_48h_i08     1 29058724    1.0808401
## NGT_PAL_54h_i08     1 16633350    1.0746095
## NGT_CTL_12h_i09     1  2689917    0.6530354
## NGT_CTL_18h_i09     1  2350634    0.4396630
## NGT_CTL_24h_i09     1  1887582    1.0542493
## NGT_CTL_30h_i09     1  2105259    1.0851253
## NGT_CTL_36h_i09     1  3031446    1.0608248
## NGT_CTL_48h_i09     1  2661393    1.0400073
## NGT_CTL_54h_i09     1  3263035    1.0421360
## NGT_PAL_18h_i09     1  2604589    0.8160849
## NGT_PAL_24h_i09     1  2466478    1.0557682
## NGT_PAL_30h_i09     1  2478222    1.0928952
## NGT_PAL_36h_i09     1  4041538    1.0838642
## NGT_PAL_42h_i09     1  2328277    1.0966406
## NGT_PAL_48h_i09     1  2895658    1.0430843
## NGT_PAL_54h_i09     1  7810186    1.0924415
## NGT_CTL_12h_i10     1 36508830    1.0848488
## NGT_CTL_18h_i10     1 24294820    1.0361947
## NGT_CTL_24h_i10     1 24733255    1.0610117
## NGT_CTL_30h_i10     1 25766554    1.0385967
## NGT_CTL_36h_i10     1 18318736    1.0855516
## NGT_CTL_42h_i10     1 21815103    1.0378338
## NGT_CTL_48h_i10     1 26609950    1.0661739
## NGT_CTL_54h_i10     1 29618191    1.0780131
## NGT_PAL_12h_i10     1 14124206    1.1011739
## NGT_PAL_18h_i10     1 31924492    1.0504017
## NGT_PAL_24h_i10     1 21501881    1.0426263
## NGT_PAL_30h_i10     1 31019274    1.0956458
## NGT_PAL_36h_i10     1 33778392    1.0709843
## NGT_PAL_42h_i10     1 23585355    1.0228580
## NGT_PAL_48h_i10     1 28601988    1.0791196
## NGT_CTL_12h_i11     1  7261501    1.0672419
## NGT_CTL_18h_i11     1  3458015    0.5320985
## NGT_CTL_24h_i11     1 19515191    1.0599451
## NGT_CTL_30h_i11     1 15598311    1.0674345
## NGT_CTL_36h_i11     1 11805126    1.0683556
## NGT_CTL_42h_i11     1 12724085    1.0365524
## NGT_CTL_48h_i11     1 10172729    1.0829050
## NGT_CTL_54h_i11     1 18652374    1.0816149
## NGT_PAL_18h_i11     1  1804117    0.8626504
## NGT_PAL_24h_i11     1  1528284    0.8710225
## NGT_PAL_30h_i11     1 20986095    1.0611458
## NGT_PAL_36h_i11     1 15420959    1.1129617
## NGT_PAL_42h_i11     1 17898659    1.0795709
## NGT_PAL_48h_i11     1 15333605    1.1034750
## NGT_PAL_54h_i11     1 26818448    1.0794320

The CPM:

norm_cpm <- cpm(
  dge_filtered,
  log = FALSE,
  prior.count = 1
)

dim(norm_cpm)
## [1] 16230   105

We use the same helper function defined before to inspect result:

## ---- post-qc-timepoint-inspection-normalized ----

# Use normalized CPM matrix instead of raw counts
counts_norm <- norm_cpm

# Confirm sample names
samples <- colnames(counts_norm)

# Split normalized counts by timepoint
counts_by_timepoint <- split_by_timepoint(counts_norm)

# Inspect per-timepoint distributions (boxplots / density)
inspect_timepoint_counts(counts_by_timepoint)
## Inspecting time point: 12h
## Dimensions: 16230 12 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      2.42     13.75     61.81     43.83 414111.00

## Inspecting time point: 18h
## Dimensions: 16230 13 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      2.57     14.08     83.50     43.75 846902.01

## Inspecting time point: 24h
## Dimensions: 16230 13 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.525e+00 1.427e+01 6.096e+01 4.416e+01 2.362e+05

## Inspecting time point: 30h
## Dimensions: 16230 14 
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     2.56    14.27    58.57    43.87 99541.71

## Inspecting time point: 36h
## Dimensions: 16230 14 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00      2.56     14.08     60.89     43.48 502972.68

## Inspecting time point: 42h
## Dimensions: 16230 13 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     2.502    14.178    59.051    44.039 68118.402

## Inspecting time point: 48h
## Dimensions: 16230 14 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.514e+00 1.424e+01 5.864e+01 4.406e+01 1.191e+05

## Inspecting time point: 54h
## Dimensions: 16230 12 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     0.000     2.506    14.116    57.902    44.016 80476.632

Figure 3. Distribution of normalized gene expression across time points

Boxplots and density plots show the distribution of normalized expression values (CPM) for each circadian time point. Normalization reduces library size–driven differences and yields comparable expression distributions across samples and time points. The consistency of distributions indicates successful normalization and suggests that remaining variation reflects biological effects rather than technical artifacts.

3.4 Exploratory MDS analysis

The MDS plots in different grouping rules:

library(RColorBrewer)
## ---- align-metadata ----

# Extract sample names actually present in dge_filtered
dge_samples <- colnames(dge_filtered)

# Reorder and subset metadata
meta <- sample_data[match(dge_samples, sample_data$sample), ]

MDS 1 — colored by treatment (CTL vs PAL)

treatment_cols <- brewer.pal(
  length(unique(meta$treatment)),
  "Set2"
)
## Warning in brewer.pal(length(unique(meta$treatment)), "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
names(treatment_cols) <- unique(meta$treatment)

plotMDS(
  dge_filtered,
  col = treatment_cols[meta$treatment],
  labels = meta$treatment,
  main = "MDS (colored by treatment)"
)

legend(
  "topright",
  legend = names(treatment_cols),
  col = treatment_cols,
  pch = 16,
  bty = "n"
)

Figure 4. Multidimensional scaling (MDS) plot colored by treatment

Each point represents a sample positioned based on the leading log-fold-change distances of normalized gene expression. Samples are colored by treatment condition (PAL vs CTL). Separation along the primary dimensions indicates that palmitate treatment is a major source of transcriptomic variation.

MDS 2 — colored by timepoint

time_cols <- colorRampPalette(brewer.pal(8, "Dark2"))(
  length(unique(meta$timepoint))
)
names(time_cols) <- unique(meta$timepoint)

plotMDS(
  dge_filtered,
  col = time_cols[meta$timepoint],
  labels = meta$timepoint,
  main = "MDS (colored by timepoint)"
)

legend(
  "topright",
  legend = names(time_cols),
  col = time_cols,
  pch = 16,
  bty = "n"
)

Figure 5. Multidimensional scaling (MDS) plot colored by time point

Samples are colored according to circadian time point. Partial ordering along the MDS axes reflects temporal structure in gene expression, consistent with circadian regulation across the time course.

MDS 3 — colored by biological replicate (i0x)

rep_cols <- brewer.pal(
  length(unique(meta$replicate)),
  "Set3"
)
names(rep_cols) <- unique(meta$replicate)

plotMDS(
  dge_filtered,
  col = rep_cols[meta$replicate],
  labels = meta$replicate,
  main = "MDS (colored by biological replicate)"
)

legend(
  "topright",
  legend = names(rep_cols),
  col = rep_cols,
  pch = 16,
  bty = "n"
)

Figure 6. Multidimensional scaling (MDS) plot colored by biological replicate

Samples are colored by biological replicate (i0x). Replicates do not form dominant clusters, indicating that replicate-to-replicate variability is smaller than treatment- or time-related effects.

3.5 Store the Result

## ---- save-qc-objects ----

saveRDS(
  list(
    dge_filtered = dge_filtered,
    meta         = meta,
    design       = design,
    norm_cpm     = norm_cpm,
    mappings     = gene_map
  ),
  file = "data/d1_qc_objects.rds"
)

4. Differential Expression Analysis

4.1 Model design and contrast specification

Our previous step showed the data has:

  1. Primary separation: circadian timepoint
  2. Secondary separation: palmitate treatment
  3. Minor variation: biological replicate (i0x)

In order to hypothessices: Palmitate causes a consistent shift in expression across all timepoints

Use additive design: expression ~ timepoint + treatment.

obj <- readRDS("data/d1_qc_objects.rds")

dge <- obj$dge_filtered
meta <- obj$meta
mappings <- obj$mappings

Recreate design matrix and factors

meta$timepoint <- factor(meta$timepoint)
meta$treatment <- factor(meta$treatment, levels = c("CTL", "PAL"))
design <- model.matrix(~ timepoint + treatment, data = meta)
colnames(design)
## [1] "(Intercept)"  "timepoint18h" "timepoint24h" "timepoint30h" "timepoint36h"
## [6] "timepoint42h" "timepoint48h" "timepoint54h" "treatmentPAL"

4.2 edgeR quasi-likelihood testing

Fit the model for calculate dffrential expression use edgeR quasi-likelihood

The BCV plot:

dge <- estimateDisp(dge, design)

fit <- glmQLFit(dge, design)

plotBCV(dge)

Figure X. Biological Coefficient of Variation (BCV) plot

The BCV plot shows gene-wise biological variability as a function of average expression. Lowly expressed genes exhibit higher relative variability, while dispersion stabilizes for more highly expressed genes. The smooth trend indicates successful dispersion estimation, supporting the use of edgeR quasi-likelihood testing for differential expression analysis.

Hypothesis testing

qlf_pal_vs_ctl <- glmQLFTest(
  fit,
  coef = "treatmentPAL"
)
topTags(qlf_pal_vs_ctl)
## Coefficient:  treatmentPAL 
##             logFC   logCPM         F       PValue          FDR
## ANGPTL4 1.9734157 5.897820 137.60501 1.911157e-20 3.101808e-16
## TMEM135 0.6889545 4.675120 109.68699 1.009674e-17 8.193506e-14
## PDK4    1.8564969 5.005888  64.40425 2.114638e-12 1.144019e-08
## ACSL3   0.6178718 7.760994  58.14521 1.498943e-11 6.081963e-08
## FADS1   1.1060848 7.323202  55.79662 3.199997e-11 1.038719e-07
## SCD     1.3573818 8.769589  48.96817 3.111944e-10 8.417809e-07
## PLIN2   1.3442054 8.636022  47.92345 4.450411e-10 1.031860e-06
## KLF11   0.5003193 3.901985  44.08804 1.703516e-09 2.879579e-06
## DNAJB9  0.7326566 5.688206  44.03840 1.725381e-09 2.879579e-06
## INSIG1  0.7236491 7.461913  43.95778 1.774232e-09 2.879579e-06
res_pal <- topTags(qlf_pal_vs_ctl, n = Inf)$table %>%
  tibble::rownames_to_column("ensembl")

Reporting raw P value gives how many genes are significant beforen accounting multiple hypothesis.

sum(res_pal$PValue < 0.05)
## [1] 1223

P-values for differential expression were calculated for all genes using the edgeR quasi-likelihood framework, which models count variability using a negative binomial distribution and provides robust inference for RNA-seq data. Genes were considered significantly differentially expressed if they met a threshold of raw p-value < 0.05, a commonly used criterion for initial significance screening in exploratory transcriptomic analyses.

Using this threshold, 1,223 genes were identified as significantly differentially expressed between palmitate-treated and control conditions. This unadjusted p-value cutoff was chosen to quantify the overall extent of transcriptional change prior to multiple-testing correction, which is addressed separately to control the false discovery rate.

4.3 Multiple hypothesis correction

Here we apply:

p-value correction: Benjamini–Hochberg FDR

This Controls false discovery rate under thousands of tests to encounter the multiple hypothesis factor.

sig_res <- res_pal %>%
  filter(FDR < 0.05, abs(logFC) >= 1)

nrow(sig_res)
## [1] 14

We use additional Effect size filter: |log2FC| ≥ 1 to exclude biologically trivial changes.

The result gives 14 genes of interest that pass P value.

Because thousands of genes were tested simultaneously, raw p-values were corrected to control for multiple hypothesis testing. We used the Benjamini–Hochberg false discovery rate (FDR) correction, as implemented by edgeR, which is the standard approach for RNA-seq differential expression analysis. FDR control is preferred over more conservative methods (e.g. Bonferroni) because it balances sensitivity and specificity by limiting the expected proportion of false positives among declared significant genes.

Genes were considered significantly differentially expressed after correction if they satisfied both:

  • FDR < 0.05, and

  • |log₂ fold change| ≥ 1, to ensure biological relevance in addition to statistical significance.

Applying these criteria, 14 genes passed multiple-testing correction and were identified as significantly differentially expressed between palmitate-treated and control conditions.

4.4 volcano plot

plot the volcano plot. Here since 14 gene selected, all are highlighted.

library(ggplot2)

volcano_df <- res_pal %>%
  mutate(
    sig = FDR < 0.05 & abs(logFC) >= 1
  )

ggplot(volcano_df, aes(logFC, -log10(PValue))) +
  geom_point(aes(color = sig), alpha = 0.6) +
  scale_color_manual(values = c("grey70", "red")) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed") +
  theme_minimal() +
  labs(
    title = "edgeR QL: Palmitate vs Control",
    x = "log2 fold-change",
    y = "-log10(p-value)"
  )

map_df <- mappings %>%
  transmute(
    ensembl = as.character(ensembl_gene_id),
    symbol  = as.character(hgnc_symbol)
  ) %>%
  distinct(ensembl, .keep_all = TRUE)

sig_res_annot <- sig_res %>%
  left_join(map_df, by = c("ensembl" = "ensembl")) %>%
  mutate(label = ifelse(is.na(symbol) | symbol == "", ensembl, symbol))

Figure 9. Volcano plot showing differential gene expression between palmitate-treated (PAL) and control (CTL) skeletal muscle myotubes. Each point represents a gene, plotted by log2 fold change (logFC) on the x-axis and −log10(p-value) on the y-axis. Genes passing multiple-testing correction (FDR < 0.05) are highlighted. Vertical dashed lines indicate |logFC| = 1, and the horizontal dashed line marks the nominal p-value threshold (p = 0.05). This plot summarizes both the magnitude and statistical significance of differential expression following palmitate exposure.

4.5 Heatmap of significant genes

Plot the heatmap:

logcpm <- cpm(dge, log = TRUE, prior.count = 1)

hm <- logcpm[sig_res_annot$ensembl, ]
rownames(hm) <- make.unique(sig_res_annot$label)

hm_z <- t(scale(t(hm)))
library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.24.1
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##     genomic data. Bioinformatics 2016.
## 
## 
## The new InteractiveComplexHeatmap package can directly export static 
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
library(circlize)
## ========================================
## circlize version 0.4.17
## CRAN page: https://cran.r-project.org/package=circlize
## Github page: https://github.com/jokergoo/circlize
## Documentation: https://jokergoo.github.io/circlize_book/book/
## 
## If you use it in published research, please cite:
## Gu, Z. circlize implements and enhances circular visualization
##   in R. Bioinformatics 2014.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(circlize))
## ========================================
ha <- HeatmapAnnotation(
  treatment = meta$treatment,
  timepoint = meta$timepoint
)

Heatmap(
  hm_z,
  name = "Z-score",
  top_annotation = ha,
  col = colorRamp2(c(-2, 0, 2), c("blue", "white", "red")),
  show_row_names = TRUE,
  row_names_gp = grid::gpar(fontsize = 8),
  show_column_names = FALSE,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  row_title = "Significant genes (PAL vs CTL)"
)

Figure 10. Heatmap of genes significantly differentially expressed between palmitate-treated (PAL) and control (CTL) samples (FDR < 0.05, |logFC| ≥ 1). Expression values are shown as row-wise z-scores of log-transformed CPM values to emphasize relative expression patterns across samples. Rows correspond to genes and columns to individual samples. Samples are annotated by treatment condition and time point. Hierarchical clustering was applied to both genes and samples. Despite the presence of multiple circadian time points, palmitate-treated samples largely cluster together, indicating that treatment effects dominate over time-dependent variation, with limited overlap between conditions.

A heatmap of the significantly differentially expressed genes (FDR < 0.05, |log₂FC| ≥ 1) was generated using row-wise z-scored normalized expression values. Hierarchical clustering was applied to both genes and samples.

Despite the presence of multiple time points, which introduce an additional source of biological variation, samples largely cluster according to treatment status. Palmitate-treated samples tend to group together and are separated from control samples, indicating that lipid exposure induces a consistent transcriptional response across time.

A small number of samples show partial mixing between treatment and control clusters. This is expected given the strong circadian structure of the dataset, where time point effects can partially overlap with treatment-driven expression patterns. Nonetheless, the dominant clustering by treatment suggests that the palmitate effect is robust and detectable even in the presence of temporal variation.

Overall, the heatmap supports the differential expression results by showing coherent treatment-associated expression patterns, while also reflecting the complex, time-dependent nature of circadian transcriptomic data.

5. Discussion and dataset characteristics

Why is the dataset of interest to you?

This dataset is of interest because it directly investigates how acute lipid overload (palmitate exposure) alters circadian gene expression in human skeletal muscle cells. Circadian regulation is tightly linked to metabolic health, and understanding how saturated fatty acids perturb rhythmic transcription provides mechanistic insight into metabolic dysregulation associated with obesity and insulin resistance. The dataset combines a controlled experimental perturbation with dense temporal sampling, making it well suited for transcriptomic analysis.

What are the control and test conditions of the dataset?

The dataset contains two primary experimental conditions:

  • Control (CTL): primary human skeletal muscle myotubes cultured under standard conditions.

  • Treatment (PAL): matched myotube cultures exposed to the saturated fatty acid palmitate.

Both conditions were sampled across multiple circadian time points following synchronization.

How many samples are in each condition?

The dataset consists of 105 samples in total, all derived from individuals with normal glucose tolerance (NGT):

  • Control (NGT_CTL): samples collected at multiple time points across biological replicates.

    Treatment (NGT_PAL): samples collected at the same set of time points and replicates.

Exact sample counts per condition and time point were verified directly from sample metadata and column name parsing during data inspection. However, one replicate was measured multiple time points, by grouping them in each time point, we can see 5-7 samples per timepoint.

Were there expression values that were not unique for specific genes? How were these handled?

Yes. Multiple rows mapped to the same HGNC gene symbol, reflecting duplicated Ensembl gene IDs, transcript variants, or historical annotation artifacts. To handle this:

  • HGNC symbols were used as the primary gene identifiers.

  • When multiple rows mapped to the same symbol, unique row names were generated by appending an index, preserving all expression measurements while avoiding row name conflicts.

  • No expression values were averaged or discarded at this stage to avoid introducing bias.

Were there expression values that could not be mapped to current HUGO symbols?

Yes. A small subset of Ensembl gene IDs could not be mapped to current HGNC symbols, likely due to deprecated identifiers, non-coding RNAs, or outdated annotations. These genes were retained using their Ensembl IDs as fallback identifiers to preserve expression information and maintain reproducibility.

Were there any outliers in the dataset? How were they handled?

No samples were removed as outliers in this analysis. Exploratory quality control (library size distributions, boxplots, density plots, and MDS plots) did not reveal samples with aberrant behavior indicative of technical failure. The originating study does not report explicit sample removal. Given the use of robust statistical methods (edgeR quasi-likelihood framework), no manual outlier removal was performed.

How did you handle replicates?

Biological replicates were retained as independent samples throughout the analysis. Replicate structure was encoded in the design matrix and accounted for during normalization and differential expression modeling. No technical replicates were collapsed, allowing edgeR to properly estimate biological variability.

What is the final coverage of your dataset?

After low-count gene filtering and annotation:

  • The dataset retained the vast majority of expressed genes, with lowly expressed features removed using a minimum count and minimum sample threshold.

  • Differential expression analysis identified 1,223 genes with nominal p-values < 0.05, and 14 genes passing multiple hypothesis correction (FDR < 0.05, |log₂FC| ≥ 1).